The Annals of Statistics 

2007, Vol. 35, No. 4, 1351-1377 

DOI: 10.1214/009053606000001460 

© Institute of Mathematical Statistics, 2007 



SIZE, POWER AND FALSE DISCOVERY RATES 

By Bradley Efron 

Stanford University 

Modern scientific technology has provided a new class of large- 
scale simultaneous inference problems, with thousands of hypothe- 
sis tests to consider at the same time. Microarrays epitomize this 
type of technology, but similar situations arise in proteomics, spec- 
troscopy, imaging, and social science surveys. This paper uses false 
discovery rate methods to carry out both size and power calculations 
on large-scale problems. A simple empirical Bayes approach allows 
the false discovery rate (fdr) analysis to proceed with a minimum of 
frequentist or Bayesian modeling assumptions. Closed-form accuracy 
formulas are derived for estimated false discovery rates, and used to 
compare different methodologies: local or tail-area fdr's, theoretical, 
permutation, or empirical null hypothesis estimates. Two microarray 
data sets as well as simulations are used to evaluate the methodology, 
the power diagnostics showing why nonnuU cases might easily fail to 
appear on a list of "significant" discoveries. 

1. Introduction. Large-scale simultaneous hypothesis testing problems, 
with hundreds or thousands of cases considered together, have become a 
familiar feature of current-day statistical practice. Microarray methodology 
spearheaded the production of large-scale data sets, but other "high through- 
put" technologies are emerging, including time of flight spectroscopy, pro- 
teomic devices, flow cytometry and functional magnetic resonance imaging. 

Benjamin! and Hochberg's seminal paper [3] introduced false discovery 
rates (Fdr), a particularly useful new approach to simultaneous testing. Fdr 
theory relies on p-values, that is on null hypothesis tail areas, and as such 
operates as an extension of traditional frequentist hypothesis testing to si- 
multaneous inference, whether involving just a few cases or several thou- 
sand. Large-scale situations, however, permit another approach: empirical 
Bayes methods can bring Bayesian ideas to bear without the need for strong 
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Fig. 1. Histograms of z values from two microarray experiments. Left panel, prostate 
data, comparison of 50 nontumor subjects with 52 tumor patients for each of 6033 genes; 
Singh et al. [31]. Right panel, HIV data, comparison of A HIV negative subjects with 4 
HIV positive patients for 7680 genes; van't Wout et al. [34], discussed in [16]. The central 
peak of the prostate data histogram closely follows the theoretical N{0, 1) null density (solid 
curve), but the HIV histogram is substantially too narrow. Short vertical bars are estimated 
nonnull counts, useful for power calculations, as discussed in Section 3. Estimated null 
proportion po equals 0.93 in both experiments. 



Bayesian or frequentist assumptions. This paper pursues large-scale false 
discovery rate analysis from an empirical Bayes point of view, with the goal 
of providing a versatile methodology for both size and power considerations. 

The left panel of Figure 1 concerns a microarray example we will use 
to introduce our main ideas. 102 microarrays, 50 from nontumor subjects 
and 52 from prostate cancer patients, each measured expression levels for 
the same = 6033 genes. Each gene yielded a two-sample t-statistic tj 
comparing tumor versus nontumor men, which was then transformed to a z 
value, 

(1.1) Zi = ^-\Fioo{U)), 

where Fioo is the cumulative distribution function (c.d.f.) of a Student's t 
distribution with 100 degrees of freedom, and $ is the standard normal c.d.f. 

We expect Zi to have nearly a iV(0, 1) distribution for "null" genes, the 
ones behaving similarly in tumor and nontumor situations. The left his- 
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togram looks promising in this regard: its large central peak, which is nicely 
proportional to a A^(0, 1) density, charts the presumably large majority of 
null genes, while the heavy tails suggest some interesting "nonnull" genes, 
those responding differently in the two situations, the kind the study was 
intended to detect. 

Note. It is not necessary that the Zj's be obtained from t-tests or that 
the individual cases correspond to genes. Each of the cases might involve 
a separate linear regression for example, with the ith case yielding p-value 
Pi for some parameter of interest, and Zi = ^~^{pi). 

Section 2 reviews Fdr theory with an emphasis on the local false discovery 
rate, defined in a Bayesian sense as 

(1.2) fdr(zj) = Probjgene i is null|zj = z}. 

An estimate of Fdr (z) for the prostate data is shown by the solid curve in 
Figure 2, constructed as in Section 2, where it is suggested that a reasonable 
threshold for reporting likely nonnull genes is fdr(2;j) < 0.2. 51 of the 6033 
genes have fdr < 0.2, 25 on the left and 26 on the right. A list of these genes 
could be reported to the investigators with assurances that it contains less 
than 20% null cases. Here fdr methods are being used to control size, or 
Type I errors. 

The solid bars in Figures 1 and 2 are estimates of the nonnull histogram, 
what we would see if we had z values only for the nonnull genes, constructed 
as in Section 3. Combined with the fdr curve, the nonnull histogram helps 
assess power, the ability of the data to identify nonnull genes. Figure 2 
suggests low power for the prostate data: most of the nonnull cases have 
large values of fdr (zi), and cannot be reported on a list of interesting genes 
without also reporting a large percentage of null cases. Section 3 constructs 
some simple power diagnostics based on fdr considerations. 

Following [3], most of the Fdr literature has focussed on tail area false 
discovery rates, 

(1.3) Fdr(zj) = Probjgene i is null|zj < z} 

(or Prob{null|zj > z} depending on the sign of z). Section 2 discusses the 
relationship of fdr to Fdr, with relative efficiency calculations presented in 
Section 5. Local fdr's fit in better with empirical Bayes development, and 
are featured here, but most of the ideas apply just as well to tail area Fdr's. 

The discussion in Sections 2 and 3 assumes that the appropriate null dis- 
tribution is known to the statistician, perhaps being the theoretical A^(0, 1) 
null suggested by (1.1), or its permutation-based equivalent [also nearly 
A^(0, 1) for both data sets in Figure 1]. This is tenable for the prostate 
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Fig. 2. Local false discovery rate fdr(2;), (1.2) for prostate data, solid curve. 51 genes, 
25 on the left and 26 on the right, have fdr(zi) < 0.2, a reasonable threshold for reporting 
nonnull cases. Solid bars show estimated nonnull histogram (plotted negatively, divided by 
50^, constructed as in Section 3. Most of the nonnull cases will not be reported. 



data, but not for the HIV data. Sections 4 and 5 consider the more difficult 
and common large-scale testing situation where there is evidence against 
the theoretical null. Efron [8, 9, 10] discusses estimating an empirical null 
in situations like that for the HIV data where the central histogram does 
not match A^(0, 1). Some methodology for constructing empirical nulls is 
described in Section 4, and its efficiency investigated in Section 5. [It gives 
empirical null A^(— 0.011, 0.75^) for the HIV data, as shown in Figure 1.] 
Three pairs of related ideas are considered here: 

• Size and power calculations for large-scale simultaneous testing. 

• Local and tail-area false discovery rates. 

• Theoretical and empirical null hypotheses. 

All combinations are possible, a power analysis using local fdr with a the- 
oretical null distribution for instance, but only a few are illustrated in the 
text. 

A substantial microarray statistics literature has developed in the past 
few years, much of it focused on the control of frequentist Type I errors; see, 
for example, [7], and the review article by Dudoit, Shaffer and Boldruck [6]. 
Bayes and empirical Bayes methods have also been advocated, as in [18, 19] 
and [27], while Benjamin! and Hochberg's Fdr theory is increasingly influ- 
ential; see [15] and [33]. Lee et al. [22] and Kerr, Martin and Churchill [20] 
discuss large-scale inference from ANOVA viewpoints. Local fdr methods. 
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which this article argues can play a useful role, were introduced in [14]; sev- 
eral references are listed at the end of Section 2. The paper ends with a brief 
summary in Section 6. 

2. False discovery rates. Local false discovery rates [13, 14], are a variant 
of [3] "tail area" false discovery rates. This section relates the two ideas, 
reviews a few basic properties, and presents some general guidelines for 
interpreting fdr's. 

Suppose we have N null hypotheses to consider simultaneously, each with 
its own test statistic. 

Null hypothesis: Hi,H2, . . . , Hi, . . . , H^, 

(2.1) 

Test statistic : zi, Z2, ■ ■ ■ , Zi, . . . , z^; 

N must be large for local fdr calculations, at least in the hundreds, but the Zi 
need not be independent. (At least not for getting consistent Fdr estimates, 
though correlations can decrease the accuracy of such estimates, as detailed 
in Section 5.) A simple Bayesian "two-class" model, [14, 22, 26], underlies 
the theory: we assume that the cases are divided into two classes, null or 
nonnull, occurring with prior probabilities po or pi = 1 — po, and with the 
density of the test statistic z depending upon its class, 

Po = Prjnull}, foi^) density if null, 

(2.2) 

pi = Prjnonnull}, fi{z) density if nonnull. 

It is natural to take foiz) to be a standard A(0, 1) density in context (1.1), 
the theoretical null. Here and in Section 3 we assume that fo{z) is known 
to the statistician, deferring until Section 4 its estimation in situations like 
that for the HIV data where the theoretical null is not believable. Fdr theory 
does not require specification of fi{z), which is only assumed to be longer- 
tailed than foiz), with the nonnull Zj's tending to occur farther away from 
0. Proportion po, the Bayes a priori probability of a gene being null, is also 
supposed known here, its estimation being discussed in Sections 4 and 5. 
Practical applications of large-scale testing usually assume po large, say 

(2.3) PO > 0.9, 

the goal being to identify a relatively small set of interesting nonnull cases. 
Under assumption (2.3), po has little practical effect on the usual false dis- 
covery rate calculations, that is, on the control of Type I errors, but it will 
become more crucial for the power diagnostics of Section 3. 
Define the null subdensity 



(2.4) 



foiz) =pofoiz) 
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Fig. 3. Geometrical relationship of Fdr to fdr; heavy curve plots Fq'{z) versus F{z); 
fdr(z) is slope of tangent, Fdr(2) slope of secant. 



and the mixture density 

(2.5) fiz)=Pofo{z)+PiMz). 

The Bayes posterior probabihty that a case is null given z, by definition the 
local false discovery rate, is 

(2.6) fdr(z) ^ Pr{null|z} = Pofoiz)/ f{z) = f^{z)/f{z). 

The Benjamini-Hochberg false discovery rate theory relies on tail areas 
rather than densities. Letting Fq{z) and Fi{z) be the c.d.f.'s corresponding 
to fo{z) and h{z) in (2.2), define F^{z) = poFo{z) and F{z) =pqFo{z) + 
piFi{z). Then the posterior probability of a case being null given that its 
z-value "Z" is less than some value z is 

(2.7) Fdr(z) = Pr{null|Z < z] = F+{z)/F{z). 

(It is notationally convenient to consider events Z < z but we could just 
as well consider tail areas to the right, two-tailed events, etc.) Figure 3 
illustrates the geometrical relationship between Fdr and fdr. 
Analytically, Fdr is a conditional expectation of fdr [13], 

Fdr(z)=r fdr{Z)f{Z)dz/ r f{Z)dZ 

J —oo ' J —oo 

(2.8) 

= Ef{idT{Z)\Z <z}, 

indicating expectation with respect to f{z) [13]. That is, Fdr(z) is 
the average of fdr(Z) for Z < z; Fdr(2;) will be less than fdr(z) in the usual 
situation where fdr(2;) decreases as \z\ gets large. For example fdr(— 3.39) = 
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0.20 in Figure 2 while Fdr(-3.39) = 0.105. If the c.d.f.'s Fq{z) and ^1(2:) are 
Lehmann alternatives, 

(2.9) Fi(z) = Fo(z)° [a<l], 

it is straightforward to show that 

, r fdr(z) 1 r Fdr(z) 1 , /I 



l-fdr(z)J ^U-Fdr(z)J 
giving 

(2.11) fdr(^) =Fdr(z)/a 

for small values of Fdr. The prostate data of Figure 1 has a roughly 1/2 in 
each tail. 

Benjamini and Hochberg's [3] Fdr control rule depends on an estimated 
version of (2.7) where F(z) is replaced by the empirical c.d.f. "F" of the z 
values, 

(2.12) Fdr(z) =poi^o(^)/i^o(^) [F{z) = #{zi < z}/N]. 

Storey [32] and Efron and Tibshirani [13] discuss the connection of the fre- 
quentist Fdr control procedure with Bayesian form (2.7). Fdr(z) corresponds 
to Storey's "g-value," the tail-area false discovery rate attained at a given 
observed value Zi = z. Fdr(z) is biased upward as an estimate of Fdr(z); see 
Section 4 of [13]. 

The estimated fdr curve in Figure 2 is 

(2.13) fdr(z)=po/o(^)//(^), 

where fo{z) is the standard normal density ip{z) = exp{— z^/2}/-v/2x, po = 
0.932 is the value derived in Section 4, and f{z) is a maximum likelihood es- 
timate (MLE) of the mixture density f{z), (2.5), within the seven-parameter 
exponential family described in Section 4. This type of flexible parametric 
modeling takes advantage of the fact that f{z), as a mixture of null and 
nonnull z values, tends to be quite smooth; see Section 6 of [9]. Lindsey's 
method, Lindsey [24, 25] described in Section 4, finds f{z) using standard 
Poisson GLM software. The theory and simulations of Section 5 show only 
a moderate cost in estimation variability for fdr compared to Fdr. 

A variety of other local fdr estimation methods have been suggested: us- 
ing more specific parametric models such as normal mixtures, see [1, 28, 30] 
or [17]; isotonic regression [4]; local smoothing [2]; and hierarchical Bayes 
analyses [5, 23]. All seem to perform reasonably well. The Poisson GLM 
methodology of this paper has the advantage of easy implementation with 
familiar software, and permits a closed-form error analysis as shown in Sec- 
tion 5. 
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Classical frequentist hypothesis testing methods rely on tail areas by ne- 
cessity. Large-scale testing situations allow us to do local calculations, which 
are more natural from a Bayesian point of view. For example, the 25 prostate 
data genes having zi < —3.39 have g-value Fdr(— 3.39) = 0.105; they have 
average fdr(zj) of about 0.105 [as in (2.8)], but varying from 0.20 at the 
boundary point Zi = —3.39 down to idT{zi) = 0.01 at Zi = —4.4. This just 
says the obvious, that Zj's further from the boundary are less likely to be 
false discoveries, which is the useful message conveyed by fdr(z). The power 
diagnostics of Section 3 rely on the local Bayesian interpretation (2.6). 

The literature has not reached consensus on a standard choice of q for 
Benjamini-Hochberg testing, the equivalent of 0.05 for single tests, but 
Bayesian calculations offer some insight. The cutoff threshold fdr < 0.20 
used in Figure 2 yields posterior odds ratio 

Pr{nonnull|2}/Pr{nun|z} = (1 - fdr(z))/fdr(z) 

(2.14) 

= Pifi{z)/pofo{z)> 0.8/0.2 = A. 

If we assume prior odds ratio pi/po < 0.1/0.9 as in (2.3), then (2.12) corre- 
sponds to the Bayes factor 

(2.15) /i(2)//o(^)>36 

in favor of nonnull. 

This threshold requires a much stronger level of evidence against the null 
hypothesis then in standard one-at-a-time testing, where the critical thresh- 
old lies somewhere near 3 [11]. We might justify (2.15) as being conservative 
in guarding against multiple testing fallacies. More pragmatically, increasing 
the fdr threshold much above 0.20 can deliver unacceptably high proportions 
of false discoveries to the investigators. The 0.20 threshold, used in the re- 
mainder of the paper, corresponds to g-values between 0.05 and 0.15 for 
reasonable choices of a in (2.11); such g-value thresholds can be interpreted 
as reflecting a conservative Bayes factor for Fdr interpretation. 

Any choice of threshold is liable to leave investigators complaining that 
the statisticians' list of nonnull cases omits some of their a priori favorites. 
Conveying the full list of values idic{zi), not just those for cases judged non- 
null, allows investigators to employ their own prior opinions on interpreting 
significance. This is particularly important for low-powered situations like 
the prostate study, where luck plays a big role in any one case's results, but 
it is the counsel of perfection, and most investigators will require some sort 
of reduced list. 

The basic false discovery rate idea is admirably simple, and in fact does 
not depend on the literal validity of the two-class model (2.2). Consider the 
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28 genes in the prostate example that have zi > 3.3; the expected number 
of null genes having Zi > 3.3 is 2.71 [= 6033 • 0.932(1 - ^>(3.3))], so 

(2.16) Fdr = 2.71/28 = 0.097. 

The Fdr interpretation is that about one tenth of the 28 genes can be ex- 
pected to be null, the other nine tenths being genuine nonnull discoveries. 

This interpretation does not require independence, nor even all of the null 
genes to have the same density /o(^), only that their average density behaves 
like /q. Since the denominator 28 is observed, the nonnull density fi{z) plays 
no role. Exchangeability of the 28 cases is the only real assumption, coming 
into play when we report that each of the 28 genes has the same one tenth 
probability of being null. The local fdr has an advantage here, since the 
equivalent exchangeability assumption is made only for genes having the 
same observed z values. These ideas are examined in Section 4 of [13]. 

3. Power diagnostics. The microarray statistics literature has focussed 
on controlling Type I error, the false rejection of genuinely null cases. Dudoit, 
van der Laan and Pollard [7] provides a good review. Local fdr methods 
can also help assess power, the probability of rejecting genuinely nonnull 
cases. This section discusses power diagnostics based on fdr(z), showing, 
for example, why the prostate study might easily fail to identify important 
genes. The emphasis here is on diagnostic statistics that are dependable and 
simple to calculate. 

The nonnull density fi{z) in the two-class model (2.2), unimportant for 
the "size" calculations of Section 2, plays a central role in assessing power. 
From (2.5) and (2.6) we obtain 

/oo 
[l-idT{z)\f{z)dz = l-po 
-oo 

and 

(3.2) /i(z) = (l-fdr(z))/(z)M. 

An estimate of f{z) gives idi{z) as in (2.13), and then the estimated nonnull 
density 

(3.3) Mz) = [1 - fdr(z)]/(z) / / [1 - fdr(z')]/(^') dz' . 

' J— oo 

Power diagnostics are obtained from the comparison of fi{z) with fdr(2;). 
The expectation of fdr under /i, say "-Efdri," provides a particularly simple 
diagnostic statistic, 

^-^—^ /"OO ^ ^ roo ^ 

(3.4) Eidv= fdr(z)[l -fdr(z)]/(z)dz/ / [1 - fdr(z)]/(z) dz. 

J —oo J —oo 
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A small value of E'fdri, perhaps ii^fdri = 0.20, suggests good power, with 
a typical nonnull gene likely to show up on a list of interesting candidates 
for further study. Neither of the examples in Figure 1 demonstrates good 
power; £^fdri = 0.68 for the prostate data and 0.47 for the HIV data (the 
latter based on the empirical null fdr estimate of Section 4). 

The nonnull counts pictured in Figures 1 and 2 allow a more intuitive in- 
terpretation of formula (3.4). Suppose that the N z- values have been placed 
into K bins of equal width A, with 



(3.5) 



Xk = centerpoint of kth bin for k = 1,2, . . . ,K, 
Uk = #{zi in fcth bin}. 
Since 

(3.6) Probjgene i nonnulllzj = z} = 1 — fdr(z), 

an approximately unbiased estimate of the nonnull counts in bin k is 

(3.7) yik = [1 - fdr(a;fe)] • y^. 

The solid bars in Figures 1 and 2 follow definition (3.7), except with 
replaced by a smoothed estimate proportioned to f{xk)- Looking at Figure 2, 
an obvious estimate of the nonnull expectation E'fdri is 

(3 8) Eidv - ^k^'^^ixk)-yik ^ Ek^'^^ixk)[l-idr{xk)]f{xk) 
EkVik Efc[l-fdr(xfc)]/(^fc) 

which amounts to evaluating the integrals in (3.4) via the trapezoid rule. 
Table 1 reports on a simulation study of E'fdri. The study took 

/I n^ ATf 1 \ -+1, / = 0' probability 0.90, 

(3.9) z, ~ iV(/.„ 1) with I ^ ^^3^ probability 0.10, 

for i = l,2,...,iV = 1500. [More precisely, m = 3 + <l>-\{i - 0.5)/150), i = 
1,2,..., 150, for the nonnull cases.] The "theoretical null" columns assume 
/o = A^(0, 1), while "empirical null" estimates /o by the central matching 
method of Section 4. Both methods estimated pi = 1 —po by central matches. 
The true value of £^fdri in situation (3.9) is 0.32. The estimates Efdri are 
seen to be reasonably stable and roughly accurate. Section 5 discusses the 
downward bias in pi . 

Going further, we can examine the entire distribution of fdr under /i 
rather than just its expectation. The nonnull c.d.f. of fdr is estimated by 

(3.10) Gi{t)= yifc/E^ifc 

fc:fdr(a;fe)<i 
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Table 1 

Means, standard deviations and coefficients of variation of EiAii (3.5); 100 trials of 
situation (3.9), = 1500. True value Mdn = 0.32, pi = 0.10 





Theoretical null 


Empirical null 






Pi 




Pi 


Mean 


0.285 


0.085 


0.232 


0.076 


Stdev 


0.060 


0.015 


0.040 


0.011 


Coefvar 


0.21 


0.18 


0.17 


0.14 



for < t < 1. Figure 3 shows Gi{t) for the prostate study, for the HIV study 
[taking /o = A^(— 0.11, 0.75^), po = 0.93, as in Figure 1], and for the first of 
the 100 simulations from model (3.9). The simulation curve suggests good 
power characteristics, with 64% of the nonnull genes having fdr less than 0.2. 
At the opposite extreme, only 11% of nonnull genes in the prostate study 
have fdr less than 0.2. 

Graphs such as Figure 4 help answer the researchers' painful question 
"why are not the genes we expected on your list of nonnull outcomes?" For 
the prostate data, most of the nonnull genes will not turn up on a list of 
low fdr cases. The R program locfdr, discussed in Section 4, returns E'fdri 
and a graph of Gi{t). 

Traditional sample size calculations employ preliminary data to predict 
how large an experiment might be required for effective power. Here we 
might ask, for instance, if doubling the number of subjects in the prostate 
study would substantially improve its detection rate. 

To answer this question, denote the mean and variance of Zi by //j and 

(3.11) Zi'^ {fii,af). 

We imagine that c independent replicates of Zi are available for each gene 
(doubling the experiment corresponding to c = 2), from which a combined 
test statistic 2j is formed, 

c 

(3.12) Zi = '^Zij/^/c'^ {^/c^i,af). 

i=i 

This definition maintains the mean and variance of null cases, Zi ~ (0,cr^), 
while moving the nonnull means Jli = away from zero by the factor ^/c. 
Consider a subset of m nonnull genes, say 5, and define 



(3.13) /i = ^/ii/m, =^(/ii -/i)^/m and = ^ollm. 

s s s 



12 



B. EFRON 




dA 0£ a.4 0.6 9A 1.0 



Idr value 

Fig. 4. Estimated nonnull c.d.f. of fdr, (3.10); prostate study, HIV study, and first of 
100 simulations, model (3.9). The simulation curve suggests substantial power, with 64% 
of the nonnull cases having fdr less than 0.2. _Efdri values: 0.23 (simulation), 0.45 (HIV), 
0.68 (prostate). 

A randomly selected Zi value ".Z^" from S has mean and variance 

(3.14) Z~(/Z,A2 + a2), 

while Z, the corresponding randomly selected 2j value, has 

(3.15) Zr^{^/Zfi,cA'^ + a^), 
so 

(3.16) Z = y/Z jl + d{Z - fl) [d^ = c-{c-l)a^/{A^ + a'^)] 

gives Z the correct mean and variance. 

Let S be the set of nonnull genes having Zi> 0. We can estimate p, and 
d from the corresponding nonnull counts yi^ (the bars to the right of z = 
in Figure 2), and then use (3.16) to move those counts out from location 
to Xk = \/c{p, + d^/^(xfc — /i)}. The null counts yofc = yk — Vik do not change 
location when the sample size increases. We can do the same calculations 
for the nonnull counts having Zi < 0. Together, these provide an estimate of 
what the entire z-value histogram would look like if the sample size were 
increased by the factor c, from which we can recalculate E'fdri or any other 
diagnostic statistic. 

Table 2 shows £^fdri estimates for hypothetical expansions of the prostate 
and HIV studies. Doubling the HIV study, to 8 instead of 4 subjects in each 
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Table 2 

Hypothetical values of i?fdri for versions of the prostate and HIV studies expanded by 
factor c; based on transformation (3.17) for the nonnull counts, as calculated by R 

program locfdr 



c 


1 


1.5 


2 


2.5 


3 


Prostate 


0.68 


0.54 


0.44 


0.38 


0.34 


HIV 


0.45 


0.31 


0.23 


0.18 


0.14 



HIV category, reduces ii^fdri from 0.45 to 0.23, while doubling the prostate 
study gives less dramatic improvement. Table 2 is based on a cruder version 
of (3.16) that takes d=l, 

(3.17) Z = y/ZZ, 

in other words, simply moving the nonnull counts from Xk to \fcx\^. 

Using (3.17) tends to underestimate the reduction in i^fdri, but did not 
make much difference in this case. 

The R program locfdr does these calculations. They have a speculative 
nature, but no more so than traditional power projections. Like all of the di- 
agnostics in this section, they require no mathematical assumptions beyond 
the original two-class model (2.2). 

4. Empirical null estimation. The null density fo{z) in (2.2) is crucial to 
false discovery rate calculations, or for that matter to any hypotheses test- 
ing method. We assumed /o ~ -/V(0, 1), the theoretical null, for the prostate 
data. This seems natural in situation (1.1), being almost certainly what 
would be done if there were only a single gene's data to analyze. Large scale 
simultaneous testing, however, raises the possibility of detecting deficien- 
cies in the theoretical null, as with the HIV data in Figure 1 where the 
z-value histogram is noticeably too narrow around its central peak. This 
section concerns data-based estimates of fo{z), for example, the empirical 
null distribution /q ~ -^(~0.11, 0.75^) for the HIV data, shown in Figure 1. 

Efron [8, 10] lists four reasons why /q might differ from the theoretical 
null: 

(1) Failed assumptions. Let Y be the N hy n matrix of expression levels, 
N genes and n microarrays in our two studies, 

(4.1) yij = expression level for ith gene and jth array. 

The HIV study has N = 7680 genes and n = 8 microarrays, 4 each from 
HIV positive and HIV negative subjects. Each gene yields a two-sample t 
statistic ti comparing positive versus negative subjects, with z-value 

(4.2) z, = ^-\Fe{t,)), 
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Fq the c.d.f. of a t distribution having 6 degrees of freedom. 

The theoretical null distribution /o ~ iV(0, 1) is justified for (4.2) if the 
yjj's are normal, or by asymptotic theory as n goes to infinity, neither ar- 
gument applying to the HIV data. We can avoid these assumptions by com- 
puting a permutation null, the marginal distribution of the Zj's obtained by 
permuting the columns of Y. This gave /o~A^(0, 0.99^) for the HIV data, 
failing to explain the narrow central peak in Figure 1. 

(2) Unobserved covariates. The HIV study is observational: subjects were 
observed, not assigned to be HIV positive or negative, and similarly for 
the prostate study. Section 4 of [8] discusses how unobserved covariates in 
observational studies are likely to widen fo{z), and how this effect is not 
detectable by permutation analysis. A microarray example is presented in 
which the z-value histogram has central dispersion more than half again as 
wide as the theoretical null. Since the HIV histogram is too narrow at its 
center rather than too wide, unobserved covariates are not the problem here. 

(3) Correlation across arrays. The theoretical null as applied to (4.2) or 
(1.1) assumes independence across the columns of Y, that is, among yii,yi2, 
...,yin for gene i. Experimental difficulties can undercut independence in 
microarray studies, while being undetectable in the permutation null distri- 
bution. The HIV data was analyzed with the HIV negative subjects as the 
first four columns of Y and the positives as the last four columns. A prin- 
cipal components analysis suggested a strong pattern of correlation across 
columns, with arrays (1,3,5,7) positively correlated, and likewise arrays 
(2, 4, 6, 8). This pattern would narrow the null distribution in situation (4.2). 

(4) Correlation between genes. A striking advantage of the two-group 
model and its false discovery rate analysis in Section 2 is that it does not re- 
quire independence between genes. Estimates such as fdr(2;) =pofo{z)/f{z) 
only require consistency for f{z) (but do not achieve the full efficiency at- 
tainable from knowledge of the gene- wise correlation structure) . 

Efron [10] emphasizes a caveat to this argument: even if the theoretical 
null is individually appropriate for each null gene, correlations between genes 
can make the effective null distribution fo{z) substantially narrower or wider 
than A'^(0, 1). There it is shown that the amount of correlation in the HIV 
data could easily explain a A^(— 0.11, 0.75^) null distribution. (By contrast, 
the prostate data exhibits quite small gene- wise correlations.) A permutation 
null distribution will not reveal correlation effects. 

Empirically estimating the null distribution avoids all four difficulties, and 
any others that may distort /q. There is a price to pay, though, in terms of 
accuracy: using the empirical null substantially increases the variability of 
estimated false discovery rates, as shown in Section 5. This price is unavoid- 
able in situations like the HIV study where there is clear evidence against 
the theoretical null; the null distribution provides the crucial numerator in 
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false discovery rate estimates like (2.16), where using an inappropriate null 
undercuts inferential validity. (Using the theoretical null on the HIV data 
eliminates all but 20 of the 151 genes having empirical null fdr estimate less 
than 0.20, including all of those with Zi < 0.) 

The basic empirical null idea is simple: we assume /o(-z) is normal but 
not necessarily with mean and variance 1, say 

(4.3) /o(z)~iV(<5o,a2), 

and then estimate 5o,o"o, as well as the null proportion po in (2.2), from the 
histogram data near 2 = 0. Two different methods for estimating (5o)0"0)Po) 
will be described, and their accuracies analyzed in Section 5. Both methods 
are implemented in algorithm locfdr, available through the Comprehensive 
R Archive Network, http://cran.r-project.org; locfdr produced all of 
the numerical examples in this paper. 

"Central matching," the first of our two estimation methods for {So,aQ,po), 
operates by quadratically approximating log f{z) around z = 0. To begin, the 
locfdr algorithm estimates f{z), (2.5), by maximum likelihood fitting to the 
histogram counts yk for the z values, (3.5), within a parametric exponential 
family. Figure 2 used the seven-parameter family 

(4.4) ff3{z) = Cf3e^pi^J2(3,z^^, 

Cf3 the constant making integrate to 1. 

Figure 5 illustrates central matching estimation of ((5o,o"o,po) for the HIV 
data based on the methodology in [8]. The heavy curve is log/(2;), fit by 
maximum likelihood [using a natural spline basis with 7 degrees of freedom, 
rather than the polynomials of (4.4), another option in locfdr, though (4.4) 
gives nearly identical results in this case]. A quadratic curve foiz) has been 
fit to log/(2;) around z = 0, 

(4.5) log{f+{z))=Po + Piz + P2Z^. 
Assuming /o (z) ~ A^((5o, ctq)' the log of the null sub density (2.4) is 

(4.6) log(/o+(z)) = logpo + log(2vra2) j + ^z- ^z\ 

Estimates (/?o, /32) from (8.5) translate to estimates {5o,dQ,pQ) in (4.6), 
for example, ctq = (2/32)~^^^- For the HIV data this gave 

(4.7) 5o = -0.107, ao = 0.753 and po = 0.931. 

The logic here is straightforward: we make the "zero assumption" that 
the central peak of the z-value histogram consists mainly of null cases, 
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Fig. 5. Central matching estimation of po and fo{z) ~ A/'((5o,ao) /o*" ^he HIV data; heavy 
curve is log of f{z), estimated mixture density (2.5); beaded curve is quadratic fit to log f{z) 
around 2 = 0, estimating log f^{z), (2.4). The three estimated coefficients of quadratic fit 
give {5o,ao,po)- 

and choose {So,ao,po) in (4.6) to quadratically approximate the histogram 
counts near 6 = 0. Some form of the zero assumption is required because the 
two-class model (2.2) is unidentifiable in the absence of strong parametric 
assumptions on fi. 

A healthy literature exists on estimating pQ, as in [21] and [29], all of 
which relies on the zero assumption [mostly working with p-values rather 
than z-values, e.g., pi = FQ{ti) in (4.2), where the "zero region" occurs near 
p = l]. All of this literature relies on the validity of the theoretical null, so in 
this sense (4.5) and (4.6) is a straightforward extension to situations where 
the theoretical null is untrustworthy. For the HIV data, using the theoretical 
null in (4.5) and (4.6), that is, taking (/3i,/32) equal (0,1/2), results in the 
impossible estimate po = 1.18. This will always happen when the z-value 
histogram is narrower than A^(0, 1) near z = 0. 

The zero assumption is more believable ii po, the proportion of null cases, 
is large. Efron [8] shows that if pq exceeds 0.90 the fitting method in Figure 
5 will be nearly unbiased: although the 10% or less of nonnull cases might 
in fact contribute some counts near z = 0, they cannot substantially affect 
6o and ctq; the po estimate is affected, being upwardly biased, as seen in 
Table 1. 

"MLE fitting," the second, newer, method of estimating {6o,aQ,po), is 
based on a truncated normal model. We assume that the nonnull density is 
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supported outside some known interval [— xo,xo], 
(4.8) fiiz)=0 for [-xo,xo]. 

We need the following definitions: 

Io = {i:Zi£ [-xo,xo]}, 
Nq = number of Zi G [— xo,xo], 

(4.9) 

zo = {zi,te Jo}, 

and 

(4.10) ^So,.oi^) = ^l=exp|-^(^)\ 
Then 

(4.11) e = poHo{6o,(Jo) = PToh{zi £ [-xo,xo]} 

under model (2.2). 

The likelihood function of the data {N,zo) is 

(4.12) /,.,.„,,„(iV,zo) = [0^°(l-0)^-^°] 

This is a product of two exponential family likelihoods, as discussed in Sec- 
tion 5. It is easy to numerically find the MLE estimates {5Q,ao,0) in (4.12), 
after which 

(4.13) po = 0/Ho{So,ao). 

Table 3 compares the estimates ((5o, ctqjPo) obtained from central matching 
and MLE fitting for the same 100 simulations of model (3.9) used in Table 1. 
MLE fitting does somewhat better overall, especially for 6o, the mean of /q. 
The results are encouraging, in particular showing that (Tq can be estimated 
within a few per cent. Delta method formulas for the standard deviations 
are developed in Section 5. These performed well, giving nearly the correct 
average values, as shown in the table, with small coefficient of variation 
across the 100 simulations, about 10% for central matching and 3% for 
MLE fitting. Changing sample size N = 1500 by multiple "c" changes the 
standard deviations by about l/\/c. 

The Zi's are independent in model (3.9). This is unrealistic for microarray 
studies, but as discussed in Section 5, the results may still be applicable to 
highly correlated situations. 



n 



Ho{6o,ao) 
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Table 3 

Comparison of estimates ((5o, S^OiPo), central matching and MLE fitting; 100 simulations, 

model (3.9), as in Table 1. MLE fitting took xo=2 in (4.8); ''formula" standard 
deviations from delta method calculations, Section 5. True values {5o,ao,po) = (0, 1,0.9) 



Central matching MLE fitting 





mean 


stdev 


(formula) 


mean 


stdev 


(formula) 




0.021 


0.056 


(0.062) 


0.044 


0.031 


(0.032) 


So 


1.020 


0.029 


(0.033) 


1.035 


0.031 


(0.031) 


Po 


0.924 


0.013 


(0.015) 


0.933 


0.009 


(0.011) 



The two fitting methods have different virtues and defects. Central match- 
ing is attractive from a theoretical point of view, suggesting how we might 
go beyond normality assumption (4.3), as discussed in Section 7 of [9]. As 
mentioned before, it gives nearly unbiased estimates of 6o and ao if po ex- 
ceeds 0.9. However, it can be excessively variable, especially in estimating 
6q, and is sensitive to the range of discretization (though not the grid size 
A) in (3.5); reducing the range of rr^ in Table 3 from [—4,7.4] to [—4,6.1] 
gave notably worse performance. 

MLE fitting generally gives more stable parameter estimates, for reasons 
suggested by the influence function analysis of Section 5. It does not require 
discretization of the z- values. It does, however, depend strongly on the choice 
of xo in (4.8), which was arbitrarily set at xq = 2 in the simulations. A more 
adaptive version that began by estimating an appropriate "zero assumption" 
interval is feasible but more variable. This contrasts with central matching, 
which automatically adjusts to each situation, including ones where {6o,(7q) 
is far from (0, 1). 

Locfdr defaults to MLE fitting for fdr estimation, but also returns the 
central matching estimates. The two methods gave similar results for the 
HIV data, (?o,5o,Po) = (-0.117,0.785,0.955) for MLE fitting, compared to 
(4.7). 

Accurate estimation of {5o,aQ,po) is just as important for tail-area Fdr 
analysis (2.7) as for the local version (2.6). Section 5 computes the accuracy 
of both Fdr(z) and fdr(z). Using an empirical null is expensive in either 
venue, but the theoretical null can be an unrealistic option, as for the HIV 
data. 

Permutation methods are popular in the microarray literature, but they 
only address the first of our four listed difficulties for the theoretical null; in 
practice permutation null distributions are usually close to the theoretical 
distribution, especially for t-test statistics. Permutation and empirical null 
methods can be used together: if F{t) is the permutation c.d.f. for the t- 
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statistics in the HIV study, then we could set = rather than 

(4.2) to begin empirical null estimation. 

5. Influence and accuracy. Two different classifications of false discov- 
ery rate methods have been discussed: local versus tail area definitions, Sec- 
tion 2, and theoretical versus empirical null estimates. Section 4. This section 
derives accuracy formulas for all four combinations, based on closed-form in- 
fluence functions. In Figure 2, for example, the R algorithm locfdr reports 
fdr(3.37) = 0.2 lb 0.02 for the theoretical null and local fdr combinations, the 
standard error 0.02 coming from Theorem 1 below. The influence functions 
also help explicate differences between central matching and MLE fitting 
estimation for the empirical null. 

For numerical calculations it is convenient to assume that the z-values 
have been binned as in (3.5): into K bins of width A, centerpoint x^, for 
k = 1,2, . . . ,K , with the count in bin k. "Lindsey's method," as discussed 
in Section 2 of [12], then permits almost fully efficient parametric density 
estimation within exponential families such as (4.4), using standard Poisson 
regression software. ^ 

Locfdr first fits an estimated mixture density f{z), (2.5), to the count vec- 
tor y = (2/1,2/2) • • • 5 Uk)' , by maximum likelihood estimation within a para- 
metric family such as (4.4). For central matching, log/Q'"(z), (2.4), is fit to 
log f{z) as in (4.5)-(4.7); the fitting is by ordinary least squares over a central 
subset of Kq bins having index set say "Iq-" In Figure 5, f{z) was estimated 
using K = Al bins having centerpoints — 4.0, — 3.8, . . . , 4.0, while log/,^(2:) 
was fit to log/(z) from the i^o = 6 central bins io = (18, 19, 20, 21, 22, 23). 

Let X be the K x m structure matrix used for estimating log/(z); X 
has m = 8, kth row (1, Xfc, . . . , x^) in (4.4). Also let Xq be the K x tuq 
matrix used to describe log/o(2;); Xq has kth row (l,Xfc,x|), uiq = 3 for the 
empirical null estimate (4.5), while Xq is the K x I matrix (1,1,..., 1)' for 
the theoretical null. (Section 7 of [9] considers more ambitious empirical null 
estimates, e.g., including a cubic term.) 

Define submatrices of X and Xq, 

(5.1) X = X[io,-] and Xo = Xo[io,-], 
of dimensions Kq x m and Kq x mo; also 

(5.2) Vk = NAf{xk), k = l,2,...,K, 
an estimate of the expected count in bin k; and 

(5.3) G = X' diag(P)X, Gq = XqXq, 

where diag(;?) is a x AT diagonal matrix having diagonal elements Vk- 
Finally, let £ indicate the iT- vector with elements £k = log/(xfc), likewise £q 
for the vector (log/J^(xfc)) and ifdr for (log fdr (x^)). 
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By definition the influence function of vector £fdr with respect^to count 
vector y is the K x K matrix Mfdr/dy of partial derivatives diidTk/dyi- 

Lemma 1. The influence function of logfdr with respect to y, when 
using central matching, is 

(5.4) ^ = AG~'X', 

dy 

where 

(5.5) A = XoGq^XqX - X. 

Proof. A small change dy in the count vector (considered as continu- 
ous) produces the change di in £, 

(5.6) di = XG-^X'dy. 

Similarly if £^ = Xq^,j an mo-vector, is fit by least squares to £ = £[io], we 
have 

(5.7) d7 = Go^Xod£ and d?o = XqCq ^Xq d£. 

Both (5.6) and (5.7) are standard regression results. Then (5.6) gives di = 
di[io]=XG-^X'dy, yielding 

dlQ = XoGQ^X'^XG'^X'dy 

from (5.7). Finally, 

(5.8) d{mr) = d(£o -£) = {XoGq^X'^X - X)G-^X' dy, 
verifying (5.4). □ 

Theorem 1. In the case where the z values arejndependent, the delta- 
method estimate of covariance for the vector o/ logfdr(xjfc) values, based on 
central matching, is 

(5.9) cdv{£Mr)=AG-^A'. 

Proof. Under independence, y has a multinomial distribution with co- 
variance matrix 

(5.10) cov(y) = diag(iy) - viy'/N, 

where v = E{y} [uk = N /\f[xk), as in (5.2)]. The delta-method covariance 
estimate is 



(5.11) 



(^-^jcov(y)(^-^j ={AG ^X)dia^{v){AG ^X)' 



AG-^A'. 
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^ Table 4 

Boldface: standard errors o/Iogfdr(z), ('^ local" ), anrf log Fdr(2;), ('HaiF), 250 
replications of model (3.9), A'^ = 1500. Parentheses; average standard deviation estimate 

from formula (5.9); fdr is the true false discovery rate (2.6) based on model (3.9). 
Natural spline basis, 7 degrees of freedom used to fit f{z), central matching for fg{z), 

empirical null case 



z 


fdr 




Theoretical null 






Empirical null 




local 


(formula) 


tail 


local 


(formula) 


tail 


1.5 


0.88 


0.05 


(0.05) 


0.05 


0.04 


(0.04) 


0.10 


2.0 


0.69 


0.08 


(0.09) 


0.05 


0.09 


(0.10) 


0.15 


2.5 


0.38 


0.09 


(0.10) 


0.05 


0.16 


(0.16) 


0.23 


3.0 


0.12 


0.08 


(0.10) 


0.06 


0.25 


(0.25) 


0.32 


3.5 


0.03 


0.10 


(0.13) 


0.07 


0.38 


(0.38) 


0.42 


4.0 


0.005 


0.11 


(0.15) 


0.10 


0.50 


(0.51) 


0.52 



Here we have used {d£fdr / dy)V = by homogeneity. As discussed below, 
formula (5.9) also has some application to the situation where the z values 
are correlated. □ 

Note, y is an approximation to the order statistic of the z values, 
exactly the order statistic if we let bin width A ^ 0. False discovery rates 
only depend upon the order statistic, facilitating compact formulas like (5.9). 

A formula similar to (5.11) exists for the tail area false discovery rates 
^Fdrfc = logFdr(xfc), 

(5.12) cdv{£Fdr)=BG-^B', 

(5.13) B = SoXoGq^X^X -SX, 

where, for the case of left-tail Fdr's, Sq and S are lower triangular matrices, 

(5.14) Ski = ^ and S^ut = ^ for £ < fe. 

Simulation (3.9) for Table 1 was extended to assess the covariance formula 
(5.9). Table 4 compares the observed standard deviations of logfdr(z), now 
from 250 trials, with the average estimates sd obtained from the square root 
of the diagonal elements of (5.9). The formula is quite accurate, especially 
in the empirical null situation; sd was reasonably stable from trial to trial, 
with coefficient of variation less than 10% for 2.5 < z < 3.5. 

The "fdr" column in Table 4 is fdr(z), (2.6), based on /o(z) ~ Ar(0, 1), 
fi{z) ~ A^(3,2) and po = 0.9 as implied by model (3.9). Decisions between 
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null versus nonnull are most difficult in the crucial range 2.5 < z < 3.5, where 
fdr(2;) declines from 0.38 to 0.03. The standard errors for local fdr estimates 
are about one third again larger than for tail area Fdr, when using the 
theoretical null. Both give stable estimates in Table 4: a 10% coefficient of 
variation might mean an estimated fdr of 0.20 it 0.02, quite tolerable in most 
large-scale testing situations. 

Estimation accuracy is much worse on the empirical null side of the ta- 
ble: a 25% coefficient of variation translates to uncomfortably variable fdr 
estimates such as 0.20 it 0.05. Now tail area Fdr's are about one third more 
variable than local fdr's [and several percent worse still if F{z) in (2.7) is 
estimated by the usual empirical c.d.f. rather than the parametric estimate 
corresponding to f{z)]. Increasing N by factor c decreases standard errors 
by roughly -y/c, so taking N = 6000 would about halve the boldface values 
in Table 4. Reducing the degrees of freedom for estimating f(z) from 7 to 5 
decreased standard errors by about one third. MLE fitting gave about the 
same results as central matching here. 

"Always use the theoretical null" is not practical advice, even if supple- 
mented by permutation methods. The theoretical or permutation null yields 
seriously misleading results for the HIV data, as discussed in [10]. Some form 
of empirical null estimation seems inevitable here, whether using tail area 
or local false discovery rates (or, for that matter, other simultaneous test- 
ing techniques). Of course one should strive for the most efficient possible 
estimation method, and MLE fitting seems to offer some advantages in this 
regard. 

The equivalent of Lemma 1 when using MLE fitting is derived from the 
likelihood (4.12). Some definitions in addition to (4.9) are needed: 



[a,b] 



-xq - So xq- 6o 



(^0 ctq 



(5.15) 



Hp{So,ao)= f zPip{z)dz, p = 0,l,2,3,4, 

J a 



P r 





Hp + p—Hp^i 



Co/ \(J0 



[where Hp = Hp{6o,ao), etc.], and likewise a,b,Hp,Ep for these quantities 
when ((5o,cJo) = ((5o,cto). 

Conditional on A'^o, the number of Zi values observed in [— xo,a;o], the 
second factor in (4.12) is a two-parameter exponential family with bivariate 
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sufficient statistics 

V lo 

Y has expectation {Ei{6o, ao) , E2{do, ao)) and covariance matrix 



(5.16) 



(5.17) 



V = — 

No 



E2 — El 
E^ — E1E2 



E3 — E1E2 \ 
E4- E^ J 



By definition, an estimate of foiz) using MLE fitting depends only on 
the counts "yo" within [— xo,xo], corresponding say to index set /Co, length 
Kq. Let Mq be the 3 x Kq matrix whose kth column equals (1, X/t — Yi, — 
Y2y for A: S /Co- Straightforward but lengthy exponential family calculations 

produce the influence function of £q = {log /q (xk)) with respect to yo, a 
K X Kq matrix. 



(5.18) 



1 



dip _ 



N 



UJV-^ 



(5.20) 



where Ik is a vector of K I's, V the estimated version of (5.17), 

(5.19) J = ^o{l f 

and U the K x 2 matrix with kth row 

I' xk-So _ Hi {xk - Spf -a^ H2- Ho 
V 0=0 #0' 



Uk 







H 







Since d£/dy = XG ^X' as before, (5.9) and (5.18) combine to give dlidv/dy 
for MLE fitting: 

Lemma 2. The influence function of logfdr with respect to y, using 
MLE fitting, is the K x K matrix 



(5.21) 



dmr 



1 



dy NHopo 



No 
N 



UJV 



M-XG~^X', 



where M is the 3x K matrix with kth column (1, — Yi, — 12)' for k G /Co, 
and (0,0,0)' fork(^lCo- 



Delta-method estimates of cov(£fdr) for MLE fitting are obtained from 
Lemma 2 as in (5.11), though the formula does not collapse neatly as in 
(5.9). We can employ Lemmas 1 and 2 to compare central matching with 
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MLE fitting for the sensitivity of Fdr(z) to changes in the count vector y. 
The left panel of Figure 6 compares the two influence functions dlogfdr(z = 
3.45) /dyk, plotted versus bin centerpoint Xk,k = 1,2, . . . , K, ior the prostate 
data [z = 3.45 has fdr(z) = 0.2 for both empirical methods, rather than 
fdr(3.37) = 0.2 using the theoretical null]. MLE fitting used xq = 2 in (4.8), 
accounting for the discontinuities there in its influence curve. The right panel 
shows the "variance spectrum" 

(5.22) 5,(.)=(^il^^)\, k = l,2,...,K, 

Uf. the estimated expectation of y^, (5.2). The delta-method estimate of the 
standard deviation for logfdr(2:) is 

_ / K \ 1/2 

(5.23) sd{z) = \Y^Sk{z)^ 

as in (5.11), so variance is proportional to the area under the curve. In this 
case, MLE fitting has less area and smaller estimated standard deviation. 
If we were to use an empirical null here, rather than the theoretical null of 
Figure 3, this would argue for MLE fitting. 

Let ^ = (po, (5oi S'o)- Results similar to Lemma 1 and Theorem 1 yield 
closed- form expressions for the delta-method estimate of cov(^). For central 
matching, 

(5.24) cov(0 = DG-^D' - E, 
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-E a 3 X 3 matrix with En = 1/N and all other entries 0, and 
(5.25) D=\o 52 25d^ 1 Gq^X'oX, 



.0 

G,Gq,Xq and X as in (5.1), (5.3). 

The corresponding estimate of cov(,^) using MLE fitting is 

(5.26) cov(0 = aba', 
with a and h both 3x3 matrices, 

(5.27) a=fl/f»- f), e-JjLl^Bl^ 



and 

(5.28) h: 



hj ^o\Hq Hi 



PoHo{l-poHo/N), 0' 

0, JV^^J/NqJ' 



J and y as in (5.17), (5.19). Locfdr returns the standard deviation estimates 
of po,Sq and ao based on (5.24) and (5.26). 

Model (3.9) presumes that the null genes are exactly null. Figure 7 is 
based on a more relaxed model: 

r5?9l ? '~Arr,/ n with probability 0.90, 

(5.29) ~ N{fi„ 1) with I ^ ^^^^ probability 0.10, 

i = l,2,...,A^ = 1500. In an observational study this might reflect unob- 
served covariates that jitter even the null cases, as in Section 4 of [8]. In 
terms of the two-class model (2.2), (5.29) amounts to po = 0.90, 

(5.30) /o(z)~iV(0,1.122) and /i(z) ~ iV(3, 2). 

Using the theoretical A^(0, 1) null in situation (5.29) gives misleading re- 
sults: fdr(z) tends to be far too liberal in diagnosing nonnull genes, as shown 
by the beaded curve in Figure 7. Empirical null estimation gives fdr(2:) es- 
timates much closer to the true curve fdr(z) = Pofo{z)/ f{z) from (5.30). 
This just says the obvious: empirical methodology correctly estimates fo{z) 
in (5.30) [central matching gave ((5o,ao) estimates averaging (0.02,1.14)], 
which is the whole point of using empirical nulls. Section 4 of [8] discusses 
what "the correct null" means in this situation, and why it cannot be found 
by the usual permutation methods. 

Our covariance estimates, such as (5.9), were derived assuming indepen- 
dence among the components of z = (z, Z2, ■ ■ ■ , zi\i) (almost true for the 
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Fig. 7. Average local false discovery rates fdr(z), 250 replications of (5.29); heavy curve 
using empirical null; beaded curve theoretical null; dots true fdr, (2.6) and (5.30). Using 
the theoretical null yields far too many nonnull genes, including some for Zi <0. 

prostate data but not the HIV data). However the influence function for- 
mulas have a wider range of applicability. The delta-method estimate of 
covariance 

(5.31) c5v(£Wr) = {d£Mr/dy)cm{y){d£{dr/dyy 

applies just as well to correlated Zj's. What changes is that (5.10) no longer 
represents cov(y). 

The development in Section 3 of [10] suggests that the estimate 

(5.32) cov(y) =diag(P), 

with Pfc = NAf{xk) as in (5.2), is still appropriate in a conditional sense for 
the correlated case. Speaking broadly, employing an empirical null amounts 
to conditioning the estimate fdr(2:) on an approximate ancillary statistic 
("A" in [10]), after which (5.31) and (5.32) gives the appropriate conditional 
covariance. This amounts to using (5.9), or its equivalent for MLE fitting, 
as stated. More careful estimates of cov(y) in (5.3) are available in the 
correlated z situation, but the formulas of this section are at least roughly 
applicable, especially for comparing different estimation techniques. 

6. Summary. Large-scale simultaneous testing situations, with thousands 
of hypothesis tests to perform at the same time, are illustrated by the two 
microarray studies of Figure 1. False discovery rate methods facilitate both 
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size and power calculations, as discussed in Sections 2 and 3, bringing em- 
pirical Bayes ideas to bear on simultaneous inference problems. Two types 
of false discovery rate statistics are analyzed, the more familiar tail area 
Fdr's introduced by Benjamini and Hochberg [3], and local fdr's, which are 
better suited for Bayesian interpretation. Power diagnostics may show, as 
in our examples, that a majority of the nonnull cases cannot be reported 
as "interesting" to the investigators without including an unacceptably high 
proportion of null cases. 

Fdr methods, either local or tail area, are easy to apply when the ap- 
propriate null distribution is known to the statistician from theoretical or 
permutation considerations. However, it may be clear that the theoreti- 
cal/permutation null is incorrect, as with the second histogram of Figure 
1. Section 4 gives four reasons why this might happen, especially in observa- 
tional studies. Two methods of estimating an "empirical null" distribution 
are presented, and formulas for their accuracy derived in Section 5. Using an 
empirical null decreases the accuracy of false discovery rate methods, both 
local and tail area, but is unavoidable in situations like the second microar- 
ray study. Software in R, locfdr, is available through CRAN for carrying out 
all the fdr size and power calculations. 
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